﻿\subsubsection{ARM64}

\myparagraph{GCC}

Let's compile the example using GCC 4.8.1 in ARM64:

\lstinputlisting[numbers=left,label=hw_ARM64_GCC,caption=\NonOptimizing GCC 4.8.1 + objdump,style=customasmARM]{patterns/01_helloworld/ARM/hw.lst}

There are no Thumb and Thumb-2 modes in ARM64, only ARM, so there are 32-bit instructions only.
The Register count is doubled: \myref{ARM64_GPRs}.
64-bit registers have \TT{X-} prefixes, while its 32-bit parts---\TT{W-}.

\myindex{ARM!\Instructions!STP}
The \TT{STP} instruction (\IT{Store Pair}) 
saves two registers in the stack simultaneously: \RegX{29} and \RegX{30}.

Of course, this instruction is able to save this pair at an arbitrary place in memory, 
but the \ac{SP} register is specified here, so the pair is saved in the stack.

ARM64 registers are 64-bit ones, each has a size of 8 bytes, so one needs 16 bytes for saving two registers.

The exclamation mark (``!'') after the operand means that 16 is to be subtracted from \ac{SP} first, and only then
are values from register pair to be written into the stack.
This is also called \IT{pre-index}.
About the difference between \IT{post-index} and \IT{pre-index} 
read here: \myref{ARM_postindex_vs_preindex}.

Hence, in terms of the more familiar x86, the first instruction is just an analogue to a pair of
\TT{PUSH X29} and \TT{PUSH X30}.
\RegX{29} is used as \ac{FP} in ARM64, and \RegX{30} 
as \ac{LR}, so that's why they are saved in the function prologue and restored in the function epilogue.

The second instruction copies \ac{SP} in \RegX{29} (or \ac{FP}).
This is made so to set up the function stack frame.

\label{pointers_ADRP_and_ADD}
\myindex{ARM!\Instructions!ADRP/ADD pair}
\TT{ADRP} and \ADD instructions are used to fill the 
address of the string \q{Hello!} into the \RegX{0} register, 
because the first function argument is passed
in this register.
There are no instructions, whatsoever, in ARM that can store a large number into a register (because the instruction
length is limited to 4 bytes, read more about it here: \myref{ARM_big_constants_loading}).
So several instructions must be utilized. The first instruction (\TT{ADRP}) writes the address of the 4KiB page, where the string is
located, into \RegX{0}, 
and the second one (\ADD) just adds the remainder to the address.
More about that in: \myref{ARM64_relocs}.

\TT{0x400000 + 0x648 = 0x400648}, and we see our \q{Hello!} C-string in the \TT{.rodata} data segment at this address.

\myindex{ARM!\Instructions!BL}

\puts is called afterwards using the \TT{BL} instruction. This was already discussed: \myref{puts}.

\MOV writes 0 into \RegW{0}. 
\RegW{0} is the lower 32 bits of the 64-bit \RegX{0} register:

\input{ARM_X0_register}

The function result is returned via \RegX{0} and \main returns 0, so that's how the return result is prepared.
But why use the 32-bit part?

Because the \Tint data type in ARM64, just like in x86-64, is still 32-bit, for better compatibility.

So if a function returns a 32-bit \Tint, only the lower 32 bits of \RegX{0} register have to be filled.

In order to verify this, let's change this example slightly and recompile it.
Now \main returns a 64-bit value:

\begin{lstlisting}[caption=\main returning a value of \TT{uint64\_t} type,style=customc]
#include <stdio.h>
#include <stdint.h>

uint64_t main()
{
        printf ("Hello!\n");
        return 0;
}
\end{lstlisting}

The result is the same, but that's how \MOV at that line looks like now:

\begin{lstlisting}[caption=\NonOptimizing GCC 4.8.1 + objdump]
  4005a4:       d2800000        mov     x0, #0x0      // #0
\end{lstlisting}

\myindex{ARM!\Instructions!LDP}

\INS{LDP} (\IT{Load Pair}) then restores the \RegX{29} and \RegX{30} registers.

There is no exclamation mark after the instruction: this implies that the values are first loaded from the stack,
and only then is \ac{SP} increased by 16.
This is called \IT{post-index}.

\myindex{ARM!\Instructions!RET}
A new instruction appeared in ARM64: \RET. 
It works just as \TT{BX LR}, only a special \IT{hint} bit is added, informing the \ac{CPU}
that this is a return from a function, not just another jump instruction, so it can execute it more optimally.

Due to the simplicity of the function, optimizing GCC generates the very same code.
